Search CORE

50 research outputs found

Statistical Learning and Kernel Methods in Bioinformatics

Author: Guyon I.
Schölkopf B.
Weston J.
Publication venue
Publication date: 01/01/2003
Field of study

Digging into acceptor splice site prediction : an iterative feature selection approach

Author: A.I. Blum
A.K. Jain
C. Mathé
D. Mladenić
E. Alpaydin
G.R. Harik
H. Mühlenbein
I. Guyon
I. Guyon
J. Weston
M. Kudo
M. Pertea
P. Larrañaga
R. Kohavi
R.O. Duda
S. Degroeve
T. Joachims
X. Zhang
Y. Saeys
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2004
Field of study

Feature selection techniques are often used to reduce data dimensionality, increase classification performance, and gain insight into the processes that generated the data. In this paper, we describe an iterative procedure of feature selection and feature construction steps, improving the classification of acceptor splice sites, an important subtask of gene prediction. We show that acceptor prediction can benefit from feature selection, and describe how feature selection techniques can be used to gain new insights in the classification of acceptor sites. This is illustrated by the identification of a new, biologically motivated feature: the AG-scanning feature. The results described in this paper contribute both to the domain of gene prediction, and to research in feature selection techniques, describing a new wrapper based feature weighting method that aids in knowledge discovery when dealing with complex datasets

Crossref

Ghent University Academic Bibliography

Efficient Feature Selection and Multiclass Classification with Integrated Instance and Model Based Learning

Author: Allwein E.L.
Crammer K.
Guyon I.
Hagger W.W.
Pavlidis P.
Weston J.
Xu Z.
Zhang M.L.
Publication venue: Libertas Academica
Publication date: 01/01/2012
Field of study

Multiclass classification and feature (variable) selections are commonly encountered in many biological and medical applications. However, extending binary classification approaches to multiclass problems is not trivial. Instance-based methods such as the K nearest neighbor (KNN) can naturally extend to multiclass problems and usually perform well with unbalanced data, but suffer from the curse of dimensionality. Their performance is degraded when applied to high dimensional data. On the other hand, model-based methods such as logistic regression require the decomposition of the multiclass problem into several binary problems with one-vs.-one or one-vs.-rest schemes. Even though they can be applied to high dimensional data with L1 or Lp penalized methods, such approaches can only select independent features and the features selected with different binary problems are usually different. They also produce unbalanced classification problems with one vs. the rest scheme even if the original multiclass problem is balanced

CiteSeerX

Crossref

Directory of Open Access Journals

PubMed Central

Large-scale Nonlinear Variable Selection via Kernel Random Features

Author: A Beck
A Rakotomamonjy
B Schölkopf
DX Zhou
F Bach
GI Allen
I Guyon
J Weston
K Muandet
L Rosasco
M Yamada
P Gurram
R Kohavi
S Maldonado
S Mosci
T Hastie
T Hastie
V Bolón-Canedo
V Bolón-Canedo
V Koltchinskii
Y Lin
Publication venue
Publication date: 01/09/2018
Field of study

We propose a new method for input variable selection in nonlinear regression. The method is embedded into a kernel regression machine that can model general nonlinear functions, not being a priori limited to additive models. This is the first kernel-based variable selection method applicable to large datasets. It sidesteps the typical poor scaling properties of kernel methods by mapping the inputs into a relatively low-dimensional space of random features. The algorithm discovers the variables relevant for the regression task together with learning the prediction model through learning the appropriate nonlinear random feature maps. We demonstrate the outstanding performance of our method on a set of large-scale synthetic and real datasets.Comment: Final version for proceedings of ECML/PKDD 201

arXiv.org e-Print Archive

Crossref

Hes-so: ArODES Open Archive (University of Applied Sciences and Arts Western Switzerland / Haute école spécialisée de Suisse occidentale / FH Westschweiz)

Archive ouverte UNIGE

Feature Weighting Using Margin and Radius Based Error Bound Optimization in SVMs

Author: A. Kalousis
A. Rakotomamonjy
H. Zou
I. Guyon
J. Bonnans
J. Shawe-Taylor
J. Weston
K. Duan
N. Cristianini
N.M. Leo Liberti
O. Chapelle
R. Tibshirani
T. Hastie
V. Vapnik
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2009
Field of study

Crossref

Estimation of Relevant Variables on High-Dimensional Biological Patterns Using Iterated Weighted Kernel Functions

Author: A Novikoff
C Cortes
C Nutt
D Agranoff
D Whitley
Dan Agranoff
Delmiro Fernandez-Reyes
Emily Hsieh
F Friedrichs
F Rosenblatt
Gustavo Stolovitzky
H Fröhlich
H Fröhlich
HJ Issaq
I Guyon
I Guyon
J Bedo
J Liu
J Mercer
J Shawe-Taylor
J Weston
L Davies
L Li
M Garey
M Herbster
M Minsky
M Wagner
M Wagner
MC Papadopoulos
ME Tipping
N Cristianini
O Chapelle
P Baldi
P Pelikan
S Davies
S Rojas Galeano
Sanjeev Krishna
Sergio Rojas-Galeano
T Briggs
T Conrads
T Joachims
T Van Gestel
U Alon
X Zhang
Y Ding
Y Freund
Publication venue: Public Library of Science
Publication date: 01/01/2008
Field of study

BACKGROUND The analysis of complex proteomic and genomic profiles involves the identification of significant markers within a set of hundreds or even thousands of variables that represent a high-dimensional problem space. The occurrence of noise, redundancy or combinatorial interactions in the profile makes the selection of relevant variables harder. METHODOLOGY/PRINCIPAL FINDINGS Here we propose a method to select variables based on estimated relevance to hidden patterns. Our method combines a weighted-kernel discriminant with an iterative stochastic probability estimation algorithm to discover the relevance distribution over the set of variables. We verified the ability of our method to select predefined relevant variables in synthetic proteome-like data and then assessed its performance on biological high-dimensional problems. Experiments were run on serum proteomic datasets of infectious diseases. The resulting variable subsets achieved classification accuracies of 99% on Human African Trypanosomiasis, 91% on Tuberculosis, and 91% on Malaria serum proteomic profiles with fewer than 20% of variables selected. Our method scaled-up to dimensionalities of much higher orders of magnitude as shown with gene expression microarray datasets in which we obtained classification accuracies close to 90% with fewer than 1% of the total number of variables. CONCLUSIONS Our method consistently found relevant variables attaining high classification accuracies across synthetic and biological datasets. Notably, it yielded very compact subsets compared to the original number of variables, which should simplify downstream biological experimentation

Public Library of Science (PLOS)

Crossref

Directory of Open Access Journals

PubMed Central

Sussex Research Online

St George's Online Research Archive

Top scoring pairs for feature selection in machine learning and applications to cancer outcome prediction

Author: A Statnikov
AC Tan
C Bishop
C Lai
D Geman
DG Beer
I Guyon
I Inza
J Jin
J Weston
LJ van 't Veer
Mark A Kon
MH Asyali
P Baldi
Ping Shi
Qifu Zhu
R Blanco
R Kohavi
S Hanshall
S Ma
S Yoon
SL Pomeroy
Surajit Ray
TM Cover
TR Golub
TS Furey
V Vinaya
VN Vapnik
X Zhang
Y Saeys
Y Wang
Y Wang
Publication venue: BioMed Central
Publication date: 01/01/2011
Field of study

Background The widely used k top scoring pair (k-TSP) algorithm is a simple yet powerful parameter-free classifier. It owes its success in many cancer microarray datasets to an effective feature selection algorithm that is based on relative expression ordering of gene pairs. However, its general robustness does not extend to some difficult datasets, such as those involving cancer outcome prediction, which may be due to the relatively simple voting scheme used by the classifier. We believe that the performance can be enhanced by separating its effective feature selection component and combining it with a powerful classifier such as the support vector machine (SVM). More generally the top scoring pairs generated by the k-TSP ranking algorithm can be used as a dimensionally reduced subspace for other machine learning classifiers. Results We developed an approach integrating the k-TSP ranking algorithm (TSP) with other machine learning methods, allowing combination of the computationally efficient, multivariate feature ranking of k-TSP with multivariate classifiers such as SVM. We evaluated this hybrid scheme (k-TSP+SVM) in a range of simulated datasets with known data structures. As compared with other feature selection methods, such as a univariate method similar to Fisher's discriminant criterion (Fisher), or a recursive feature elimination embedded in SVM (RFE), TSP is increasingly more effective than the other two methods as the informative genes become progressively more correlated, which is demonstrated both in terms of the classification performance and the ability to recover true informative genes. We also applied this hybrid scheme to four cancer prognosis datasets, in which k-TSP+SVM outperforms k-TSP classifier in all datasets, and achieves either comparable or superior performance to that using SVM alone. In concurrence with what is observed in simulation, TSP appears to be a better feature selector than Fisher and RFE in some of the cancer datasets. Conclusions The k-TSP ranking algorithm can be used as a computationally efficient, multivariate filter method for feature selection in machine learning. SVM in combination with k-TSP ranking algorithm outperforms k-TSP and SVM alone in simulated datasets and in some cancer prognosis datasets. Simulation studies suggest that as a feature selector, it is better tuned to certain data characteristics, i.e. correlations among informative genes, which is potentially interesting as an alternative feature ranking method in pathway analysis

CiteSeerX

Crossref

Boston University Institutional Repository (OpenBU)

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

Enlighten

Combined SVM-Based Feature Selection and Classification

Author: A. Ben-Tal
A. Yuille
B. Haasdonk
B. Schölkopf
Christoph Schnörr
G. H. John
Gabriele Steidl
I. Guyon
J. Neumann
J. Weston
J. Weston
J. Zhu
Julia Neumann
K. P. Bennett
N. Cristianini
O. Chapelle
P. S. Bradley
R. Duda
R. T. Rockafellar
R. Tibshirani
T. Jebara
T. Pham Dinh
Publication venue: 'Springer Science and Business Media LLC'
Publication date
Field of study

Crossref

Individualized markers optimize class prediction of microarray data

Author: A Ben-Dor
A von Heydebreck
A Wong
AA Ferrando
AR Dabney
C Chang
C Motz
C Wu
CA Iacobuzio-Donahue
E Coustan-Smith
E Schleiff
F Martella
G Callagy
G Salomons
I Guyon
I Inza
J Catlett
J Held-Feindt
J Li
J Lyons-Weiler
J Reiss
J Weston
J Zhang
JG Thomas
K Bloch
M Sanchez-Carbayo
M Steinau
M West
ME Lenburg
ME Ross
MI Ryder
MS Felipe
P Baldi
P Ganigi
P Luciani
P Luciani
Panayiota Poirazi
Pavlos Pavlidis
R Bijlani
R Diaz-Uriarte
S Aulwurm
S Ilyin
S Ilyin
S Kumar
S Nambiar
S Steller
S Varma
SA Armstrong
SK Shevade
SL Pomeroy
SM Arfin
T Karakas
TR Golub
TS Tanaka
U Fayyad
V Sriuranpong
W Kolch
X Chen
X Liu
X Liu
X Yan
Y Chen
Y Cheng
Y Li
Y Wang
Y Yanagi
Publication venue: BioMed Central
Publication date: 01/07/2006
Field of study

BACKGROUND: Identification of molecular markers for the classification of microarray data is a challenging task. Despite the evident dissimilarity in various characteristics of biological samples belonging to the same category, most of the marker – selection and classification methods do not consider this variability. In general, feature selection methods aim at identifying a common set of genes whose combined expression profiles can accurately predict the category of all samples. Here, we argue that this simplified approach is often unable to capture the complexity of a disease phenotype and we propose an alternative method that takes into account the individuality of each patient-sample. RESULTS: Instead of using the same features for the classification of all samples, the proposed technique starts by creating a pool of informative gene-features. For each sample, the method selects a subset of these features whose expression profiles are most likely to accurately predict the sample's category. Different subsets are utilized for different samples and the outcomes are combined in a hierarchical framework for the classification of all samples. Moreover, this approach can innately identify subgroups of samples within a given class which share common feature sets thus highlighting the effect of individuality on gene expression. CONCLUSION: In addition to high classification accuracy, the proposed method offers a more individualized approach for the identification of biological markers, which may help in better understanding the molecular background of a disease and emphasize the need for more flexible medical interventions

Crossref

Directory of Open Access Journals

PubMed Central

Accurate molecular classification of cancer using simple rules

Abstract Background One intractable problem with using microarray data analysis for cancer classification is how to reduce the extremely high-dimensionality gene feature data to remove the effects of noise. Feature selection is often used to address this problem by selecting informative genes from among thousands or tens of thousands of genes. However, most of the existing methods of microarray-based cancer classification utilize too many genes to achieve accurate classification, which often hampers the interpretability of the models. For a better understanding of the classification results, it is desirable to develop simpler rule-based models with as few marker genes as possible. Methods We screened a small number of informative single genes and gene pairs on the basis of their depended degrees proposed in rough sets. Applying the decision rules induced by the selected genes or gene pairs, we constructed cancer classifiers. We tested the efficacy of the classifiers by leave-one-out cross-validation (LOOCV) of training sets and classification of independent test sets. Results We applied our methods to five cancerous gene expression datasets: leukemia (acute lymphoblastic leukemia [ALL] vs. acute myeloid leukemia [AML]), lung cancer, prostate cancer, breast cancer, and leukemia (ALL vs. mixed-lineage leukemia [MLL] vs. AML). Accurate classification outcomes were obtained by utilizing just one or two genes. Some genes that correlated closely with the pathogenesis of relevant cancers were identified. In terms of both classification performance and algorithm simplicity, our approach outperformed or at least matched existing methods. Conclusion In cancerous gene expression datasets, a small number of genes, even one or two if selected correctly, is capable of achieving an ideal cancer classification effect. This finding also means that very simple rules may perform well for cancerous class prediction.</p

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

University of Nebraska Medical Center Research: DigitalCommons@UNMC